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A molecular understanding of how protein function is related to protein structure will require an 
ability to understand large conformational changes between multiple states. Unfortunately these 
states are often separated by high free energy barriers and within a complex energy landscape. This 
makes it very difficult to reliably connect, for example by all- atom molecular dynamics calculations, 
the states, their energies and the pathways between them. A major issue needed to improve sampling 
on the intermediate states is an order parameter - a reduced descriptor for the major subset of 
degrees of freedom - that can be used to aid sampling for the large conformational change. We 
present a novel way to combine information from molecular dynamics using non-linear time series and 
dimensionality reduction, in order to quantitatively determine an order parameter connecting two 
large-scale conformationally distinct protein states. This new method suggests an implementation 
for molecular dynamics calculations that may dramatically enhance sampling of intermediate states. 



INTRODUCTION 

Proteins represent complex dynamical systems with 
many different stable states. Sampling on these systems 
has been termed the multiple time scale problem [1 with 
many different wells and barrier heights being found. De- 
spite considerable effort, there is currently no rapid way 
to determine the range of stable states from a single struc- 
ture or from sequence alone. Because biological function 
is intrinsically linked to the large scale conformational 
change of proteins, an improved understanding of how 
conformational change in the complex energy landscape 
of the protein is determined will have an immediate im- 
pact on biological and physical questions. 

One method that has been used exhaustively for at- 
tempting to determine reduced descriptors for protein 
motions has been through the concepts of normal mode 
analysis pHS] . If the potential function driving the ener- 
gies was fully harmonic, then the solution would be an- 
alytic and all the stable states and their motions would 
be immediately determined. But, with thermal motions 
driving non-harmonic interactions such as van der Waals 
and electrostatic forces, the motions are controlled by 
a very complex, but physically defined, energy surface. 
The current paradigm for interpreting these motions is to 
run the longest possible molecular dynamics calculations 
and then to use the fluctuations seen in the trajectory 
to define a set of effective normal modes. These normal 
modes have been called the principal normal modes when 
they are restricted to the lowest collective degrees of free- 
dom. Several groups have suggested that following these 
normal modes could be used as an effective order param- 
eter for controlling large conformational change [H HI |7]- 
Nevertheless, the approach has not led to great success, 
with the modes often leading to blind alleys in the surface 
that don't reflect large conformational change jSHTT]. 

To improve on the sampling of large-scale conforma- 
tional change there have been many methods proposed. 



The most well known contemporary method has been 
called transition path sampling and uses a small number 
of conformations along a candidate transition pathway 
along with a Monte Carlo move set to try and anneal an 
optimal prediction of intermediates in a conformationally 
changing system [121113]. Alternative methods have used 
the RMS differences between states as an order param- 
eter to control change [14 , directly adding a new force 
that biases motion along the RMS gradient. We have de- 
veloped a method, called dynamic importance sampling 
(DIMS) p!5HT9] that uses concepts from stochastic differ- 
ential equations [20 to create a family of independent 
transitions that together define the likelihood of different 
pathways and the kinetics of the transition with sufficient 
sampling. However, similar to the RMS based methods, 
the DIMS method requires a progress variable to gauge 
progress of the transitions and to create the biasing and 
its correction for an unbiased estimate of pathways, ki- 
netics and states. 

In this contribution we describe the use of effective 
transfer entropy for the determination of a reduced set 
of degrees of freedom that can be used to define order pa- 
rameters behind large scale conformational change. Our 
approach combines insights from the physics of non-linear 
time-series analysis, dimensionality reduction, and the 
chemical physics of protein motions on a complex energy 
surface to enable the dynamics of the complex system to 
define an order parameter candidate. This improves on 
other methods for the determination of order parameters 
where the candidate order parameter was inferred from 
empirical analysis of the static structure or simply as- 
sumed to correlate with the RMS between two different 
states. In the calculations to follow we mainly use the 
receiver domain of nitrogen regulatory protein C (NtrC) 
(Fig. [T]), in addition, we have performed steered molec- 
ular dynamics and checks of the implementation on the 
glucose-galactose binding protein (GGBP). 
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PRINCIPAL COMPONENT ANALYSIS 

The method of principal component analysis has been 
used in the analysis of protein motions for many years. 
This approach depends on the determination of a set of 
effective harmonic modes that define the complex mo- 
tions that have been seen in the dynamics. While the 
initial excitement over the method as a way to obtain 
longer time-scales seems to have faded, there remains 
much effort to use this approach as a tool for the analy- 
sis of conformational change. Recent work has suggested 
that this approach may lead to significant systematic er- 
ror when there are multiple stable states separated by a 
large barrier. 

To compute the normal modes a molecular dynamics 
(MD) trajectory is used along with the determination of 
the average fiuctuations in the simulation. Then, from 
the MD trajectory q{t) = {qiit) , q2{t) , . . . , q^N if))) of a 
protein with TV atoms, the correlation matrix cr is built 
as follows: 

'y = {m)-{m))m)-m))))- a) 

Where the brackets ((...)) denote time averages. The 
orthonormal basis vectors (principal components/PC) fjcx 
are determined by the eigenvalue problem \cxf]a = cr- 

The lowest frequency modes from PGA are normally 
associated with slow, collective motions and have been 
used to try and predict intermediate states. Figure [l] 
depicts the lowest frequency mode obtained by applying 
equation ([T]) and, solving the eigenvalue problem for our 
600 ns trajectory of NtrC. On this plot the porcupine 
spines are located at the Cq, atoms and their magnitude 
and direction shows the type of motion involved in the 
mode. 

For a given mode a, the involvement coefficients (IC) 
are defined as: 

Vo. = \\ria-{q^-e)l (2) 

where g^'^ indicates the set of normalized coordinates 
{q^'^ • g^'^ = 1) that represent the active-state and 
inactive- state conformations, respectively. Therefore, the 
ICs measure the amount of overlap between a PC and 
the direction defined by the displacement vector between 
structures. As mentioned in the case of hinge-bending 
motions, PC A shows higher values for the ICs compared 
to those from more complex motions. For instance, in the 
case of Adenalyte Kinase (AdK), the ICs for the first two 
modes are 0.49 and 0.63, respectively. In the case of NtrC 
the ICs are much lower (Fig. [4|, in consequence the direc- 
tion of the first PCs of both stable states are not point- 
ing directly towards the other end state. In contrast, 
in another study, by using an empirical order parameter 
(based on observations of both stable structures) yields 




FIG. 1. Lowest frequency mode from PGA analysis for the 
inactive-state of NtrG . 

higher ICs values [21]. These order parameters involve 
only localized regions of the system and are proposed in 
an orderly series of events. This result would imply that 
the PCs are not pointing directly towards the other end 
state but towards a different intermediate state. 

One of the ideas behind an order parameter is that 
a few degrees of freedom dominate part of or the en- 
tire transition, while the rest of the system would fol- 
low. Therefore finding an order parameter is equivalent 
to locating such leading modes. In this paper we use an 
information theoretical approach to identify the leading 
modes by measuring the transfer entropy between pairs 
of residues. The more dominant residues are those that 
transfer the largest amount of entropy to the rest of the 
system. 

INFORMATION FLOW IN PROTEINS 

The networks of interactions between atoms and 
residues define the web of dependencies and patterns of 
dynamic coupling between domains in a protein, char- 
acterized by the directed fiow of information spanning 
multiple spatial and temporal scales. An initial appli- 
cation of transfer entropy to DNA binding proteins was 
the first to apply the asymmetry of information transfer 
to protein molecular motions [22]. Let X be the time se- 
ries for the center of mass of the ith residue and, p{X) 
its probability distribution. Therefore it is possible to 
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measure the average number of bits needed to optimally 
encode independent draws by using the Shannon entropy 
Hx = — Xla^P^c logp(x)[23l [24], where the sum extends 
over all the states that X can reach. 

Transfer entropies 

For a residue j ^ i with a center of mass Y and, proba- 
bility distribution p{Y); one could say that its trajectory 
is independent of that of residue i if 




), (3) 

where p{yn-\-i\yn) is the conditional probability to 
find residue j at state ^n+i given its past 
and, p{yn-\-i\yn^Xn) is the conditional probability to find 
residue j at state yn-\-i given the past of both i and j. In 
the case where there is not a fiux of information from X 
to Y then equation ([3| is correct. On the other hand, in 
the event that there is fiux of information in any direc- 
tion, the divergence from correctness of equation (|3| can 
be quantified by the Kullback-Leibler entropy [25 hence 
defining the transfer entropy [26j: 

T.^y = j:p(yn..,yn,x^)io, ^^^^^^^^^^ . (4) 

The transfer entropy between i and j is minimum and 
equal to zero when the two residues are independent and 
there is a maximum and equal to the entropy rate: 

hy = -^K^n+l,yn)logK^n+l|yn), (5) 

when the residues are completely coupled. In order 
to minimize artifacts within the time series we use the 
normalized effective transfer entropy given by [27l [28] : 

(6) 

where the second term is the average transfer entropy 
from A/trials surrogated samples of X, to F. 

The set F of most dominant residues 

The total fiux between two residues X and F, can be 
calculated by the equation. 

Residues are selected according to the following rules: i 
is selected if Dx^y > 0, residue j is selected if Dx^y < 



FIG. 2. A pulling force is applied along the line defined by 
residues Phe:142 and Leu: 144 (highlighted in green). The 
arrows in red represent the modes determined from a PGA of 
the pulling trajectory. 

and, if Dx^y = then no residue is selected. The set 
of most dominant residues F is then defined as the set 
of residues that follow the rules above and also that are 
above a fixed cutoff \Dx^y\ ^ ^cutoff- 

EXPERIMENTS WITH GGBP 

To verify that our implementation was correct, we per- 
formed analysis of coupled chaotic Ulam maps, for Henon 
maps and for autoregressive processes. In addition, as a 
more challenging test case, we used the Glucose-galactose 
binding protein (GGBP) [29 . The two domains of GGBP 
exhibit a 0.5 rad hinge opening motion from one state to 
the other. The structure of the open state for an unbound 
glucose-galactose binding protein (GGBP) was crystal- 
lized by Borrock et al. (PDBID:2FW0)[29 at 1.55 A. 
For the purpose of testing we used both DIMS transitions 
and we applied a constant pulling force along the line de- 
termined by residues Phe:142 and Leu: 144 to create a 
system with a known directional change (highlighted in 
green in Figure [2| . 

As a comparison point, we performed a PC A analy- 
sis over the trajectory generated by pulling along the 
residues Phe:142 and Leu: 144. With the transient na- 
ture of the pulling, it can be seen how PGA is unable to 
detect the pulling dir ect ion ( Figure [2|. We now describe 
the data treatment and some results from our initial test- 
ing for the transfer entropy analysis that we propose. 

Time series treatment 

The time series from MD describing the atomic mo- 
tions of proteins are real valued. Estimation of the joint 
probability densities in equation Q, from real valued 
data is not only computationally expensive but unneces- 
sary. It has been shown that the amplitude of collective 
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excitations, representing correlated global motions in the 
protein, samples multicentered distributions ^ . There- 
fore, although single or double precision arithmetic is nec- 
essary for the stability and accuracy of the simulations, 
if done carefully, discretization or coarse-graining of the 
data does not affect these distributions while greatly re- 
ducing noise and increasing computational efficiency. We 
optimize our implementation by incorporating high per- 
formance computing techniques (massively parallel calcu- 
lations extended over thousands of cores) and by apply- 
ing dimensionality reduction and data mining techniques 
that we briefly describe in the following sections. 

In other applications of transfer entropies [22[ |27l [28l 
[3TJ [32] discretization of the data is performed mainly 
by using symbolization techniques. In some cases the 
discretization is so severe that all data is mapped to single 
bit time series (spikes), as is the case to analyse data from 
neurophysiological data in epilepsy patients. 

Piecewise Aggregate Approximation (PAA) 

A time series q(t) = {qi(t), q2(t), • • • , QsNit))) of length 
n can be represented by a second time series Q{t^) = 
{Qi(t'),Q2{t'),...,Q2,N{t'))) of length w < n, where 
each element Q{t') is computed according to[33j: 

Q{t') = ^ mdt, (8) 

where At = n/w. In other words, each vector of the 
time series Q{t') is simply the average, over a time range 
At, of the time series q{t). When At is constant, PAA can 
be seen as an attempt to approximate the original time 
series with a series of linear functions. Other approaches 
of PAA include using an adaptive mechanism to adjust 
At according to certain rules, i.e. defining a threshold 
such that a{t = T) < {q{t) - {q{t))t=i...T)t=i...T' For ah 
calculations we set the time range At = 0.1 ns. 
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FIG. 3. Transfer entropies computed for DIMS trajectories 
for GGBP. 

FINDING THE LEADING MODES ON NTRC 

The structures of the inactive-state and active-state 
conformations of NtrC have been solved by NMR[36l- 
i38j . At room temperature NtrC samples both conforma- 
tional states, however after phosphorylation the active 
states dominate the ensemble set of populations. Recent 
studies suggest that the transition pathway between the 
two conformations can be decomposed in a series of seg- 
mented progress variables (order parameters) pT] . For 
this study both states were solvated in box of dimen- 
sions 20 A X 20 A X 20 A with TIP3 waters, equilibrated 
for 15 ns; the total number of atoms, including solvent 
and ions, is 12168 and 13688 for the active and inactive 
sates respectively. Production runs were performed for 
600 ns using NAMD2.7b2[39 at NICS-Kraken. Analy- 
sis of the trajectories were executed using our code at 
NCSA- Abe/Lincoln. 



Transfer entropies from DIMS trajectories 

In a previous work we generated a set of transitions 
for GGBP [I9];the simulations were carried out using 
CHARMM27FF with CMAP with our implementa- 
tion of DIMS and using an implicit solvent (ACE2)[35 . 
The rotational and translational degrees of freedom were 
removed by rms fitting the target structure to the evolv- 
ing system and, the alignment atoms were selected on 
the N- terminal domain (Residues 111 to 252 and, 293 to 
305). By applying our transfer entropy analysis we were 
able to identify the key residues in the DIMS transition 
(Figure [3|. The results show that the leading residues 
for the transition are located in the three-segment hinge 
that connects the N- and C- termini [31 



Computing the modes 

A key insight is that the atoms with the strongest lead- 
ing effective transfer entropy can be used as a subset of 
degrees of freedom to define collective modes that are new 
candidate order parameters. To accomplish this goal, 
once a cutoff and a time-length for the interrogation of 
the dynamics has been defined, is straightforward. The 
modes are determined by fluctuations of the leading ef- 
fective transfer components and together describe a set 
of collective motions. 

For the residues in the set F we compute the corre- 
lation matrix as in equation ([T]) over the full trajectory 
and obtain a set of modes ff^. The involvement coef- 
ficients (equation ([2|) for different values of the cutoff 
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FIG. 4. Involvement coefficients for tlie two states of NtrC 
for different cutoffs i^cutoff- 

^cutoff are presented in Figure [4j As the cutoff increases 
fewer residues are selected as dominant, fiowever tfie in- 
volvement coefficients are clearly increasing. Tfiis sug- 
gests tfiat tfie most dominant modes rfj are pointing to- 
wards tfie end structure. Since tfie modes are transferring 
entropy to tfie entire system biasing along tfiese modes 
would result in a collective bias for the entire system. 

Since rja is an orthonormal base we can define the cu- 
mulative involvement coefficient of the first a PCs 
as: 

a 

and measure how much of the overall difference is ac- 
counted by the first a modes. 

This last figure suggests that relatively short molecular 
dynamics simulations are converging onto the important 
degrees of freedom determined by the effective transfer 
entropy analysis. It suggests that an algorithm for the 
use of the effective transfer entropy modes can be readily 
defined in CHARMM or other computer code. In that 
algorithm the lowest frequency modes would be the direc- 
tion of biasing that is applied through DIMS or another 
approach (e.g. transition path sampling or targeted MD). 
The modes would be defined by a relatively short unbi- 
ased simulation and then followed by biasing for a similar 
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FIG. 5. Cumulative involvement coefficient as a function of 
time(ns) for the first a — 20 modes. 

amount of time to the mode determination. For exam- 
ple, this figure would suggest that 5 ns of sampling for 
the effective modes followed by 5 ns of sampling along the 
modes could be used to improve the confidence that the 
most important intermediate states are being reached. 
This would then be repeated with unbiased sampling in- 
cluding light restraints on the backbone atoms to define 
a new set of effective transfer entropy modes. By contin- 
uing this process until the end state is reached, a tran- 
sition pathway would be defined. If this process is then 
repeated for multiple starting points with various sam- 
pling windows and different random number seeds, along 
with a random selection of cutoffs and mode selections, 
then a good sampling of the intermediate space should 
be obtained. We expect to implement this type of algo- 
rithm in the future and the results will be presented at 
that time. 



CONCLUSIONS 

A molecular understanding of how protein function is 
related to protein structure will require an ability to un- 
derstand large conformational changes between multiple 
states. Unfortunately these states are often separated 
by high free energy barriers and within a complex en- 
ergy landscape. This makes it very difficult to reliably 
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connect, for example, by all- atom molecular dynamics 
calculations, the states, their energies and the pathways 
between them. A major issue needed to improve sam- 
pling on the intermediate states is an order parameter 
- a reduced descriptor for the major subset of degrees 
of freedom - that can be used to aid sampling for the 
large conformational change. In this paper we present 
a novel way to combine information from molecular dy- 
namics using non-linear time series and dimensionality 
reduction, in order to quantitatively determine an or- 
der parameter connecting two large-scale conformation- 
ally distinct protein states. The results presented show 
that the leading modes can be computed from short sim- 
ulations. This new method suggests an implementation 
for molecular dynamics calculations that may dramati- 
cally enhance sampling of intermediate states. 



[1] B. Berne, Multiple Time Scales, edited by J. Brackbill 
and B. Cohen (Academic Press, 1985) p. 419. 

[2] O. M. Florence Tama, Florent Xavier Gadea and Y.-H. 

S anej ouand , Prote ins: Structure, Function, and Genetics] 

[~41, 1 (2000) ^ 

[3] W. Zhenggo and B. R. Brooks, Biophysical Journal 88, 
[ 3109 (2005) 

[4] D. M. van Aalten, A. Amadei, A. B. Linssen, V. G. Ei- 

^jsink, G. Vriend, and H. J. Berendsen, Proteins 22, 45| 

|~~(1995) 

[5] S. Hay ward, A. Kitao, and H. J. Berendsen, Proteins 

27, 425 (1997). 
[6] L. Skjaerven, B. Grant, A. Muga, K. Teigen, J. A. Mc- 

Cammon, N. Renter, a nd A. Martinez, PLoS Comput] 

Biol 7, el002004 (20Tl)| 
[7J G. V. Miloshevsky, A. Hassanein, and P. C. Jordan, 

Biophys J 98, 999 (2010) 
[8] M. A. Balsera, W. Wriggers, Y. Oono, and K. Schulten, 

Journal of Physical Chemistry 100, 2567 (1996) 
[9] J. Ma, St ructure 13, 373 (2005) | 



[10] P. Petrone and V. S. Pande, 'Bio phys J 90, 1583 (2006 
[11] A. Ramanathan and P. K. Agarwal,|J Phys Chem B 113^| 
[ 16669 (2009) 

[12] C. Dellago, P. G. Bolhuis, F. S. Csajka, and D. Chandler, 

The Journal of Chemical Physics 108, 1964 (1998) 
[13 ] P. G. Bolhuis, C. Dellago, and D. Chandler, Proc Natl| 
Acad Sci U S A 97, 5877 (2000)] 



[14] P. Maragakis and M. Karplus, Journal of Molecular Bi- 
[ ology 352, 807 (2005) 

[15] T. Woolf, Chemical Physics Letters 289, 433 (1998) 
[16] D. M. Zu ckerman and T. B. Woolf, Physical Review E, 
\ 63, 016702 (2000)1 ' ' 



D. M. Zuckerman and T. B. Woolf, The Journal of Chem^ 
ical Physics 111, 9475 (1999) 

D. M. Zuckerman and T. B. Woolf, The Journal of Chem- 
ical Physics 116, 2586 (2002). 

J. R. Perilla, O. Beckstein, E. J. Denning, and T. B. 
Woolf, Jou rnal of C omputa tional Ch emistry 32, 196| 
X2010) 



W . Wagner, [Journal of Computational Physics 71, 21 
(1987) 



M. Lei, J. Velos, A. Gardino, A. Kivenson, M. Karplus, 
and D. Kern, [Journal of Molecular Biology 392, 823] 
(2009) 



H. Kamberaj and A. van der Vaart, [Biophysical Journal 
97, 1747 (20091 



T\ M! Reza, jAn introduc tion to information theory 

Vol. 44 (Dover PublicationsT 1994) pr671679_^ 

C. E. Shannon, MD computing computers in medical 

practice 27, 306 (1948) ^^^^^^ 

S. Kullback and R. A. Leibler, The Annals of Mathemat- 
ical Statistics 22, 79 (1951) 

T. Schreiber, Physical Review Letters 85, 461 (2000). 
B. Gourevitch and J. J. Eggermont, [Journal of Neuro- 
physiology 97, 2533 (2007) " 
R. Marschinski and H. Kantz, European Physical Journal 
B 30, 275 (2002). 



[29] J.J. Borrok, L. L. Kiessling, and K. T. Forest, [Protein 



science : a publication of the Protein Society 16, 1032 



(2007) [ 



A. E. Garcia, Physical Review Letters 68, 2 696 (1992). 



[32^ 



[33] 



M. Lungarella, A. Pitti, and Y. Kuniyoshi, Physical Re- 
_view E 76, 056117 (2007) 

M. Staniek and K. Lehnertz, [Physical Review Letters 
100, 158101 (200^87 



[34] 



M 

[36] 
[37] 



[39] 



J. Lin, E. Keogh, S. Lonardi, and B. Chiu, [Proceedings 
_of the 8th ACM SIGMOD workshop on Research issues 
in data mining and knowledge discovery DMKD 03 , 2 
(2003) 

B. Brooks, C. Brooks, A. Mackerell, L. Nilsson, R. Pe- 
trella, B. Roux, Y. Won, G. Archontis, C. Bartels, 
S. Boresch, A. Caflisch, L. Caves, Q. Cui, A. Din- 
ner, M. Feig, S. Fischer, J. Gao, M. Hodoscek, W. Im, 
K. Kuczera, T. Lazaridis, J. Ma, V. Ovchinnikov, E. Paci, 
R. Pastor, C. Post, J. Pu, M. Schaefer, B. Tidor, R. Ven- 
able, H. Woodcock, X. Wu, W. Yang, D. York, and 
M . Karplus, Journal of computational chemistry 30, 1545| 
^2009) 

M . Schaefer, C. Bartels, and M. Karplus, Journal of 
molecular biology 284, 835 (1998) 

D. Kern, B. F. Volkman, P. Luginbiihl, M. J. Nohaile, 
S. Kustu, and D. E. Wemmer, Nature 402, 894 (1999) 

B. F. Volkman, D. Lipson, D. E. Wemmer, and D. Kern, 
'Science 291 , 2429 (2001), 

C. A. Hastings, S.-Y. Lee, H. S. Cho, D. Yan, S. Kustu, 
and D. E. Wemmer, Biochemistry 42, 9081 (2003) 

J. C. Phillips, R. Braun, W. Wang, J. Gumbart, 

E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, 
and K. Schulten, J Comp Chem 26, 1781 (2005), 



